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ABSTRACT 

Accurately characterizing transcription factor 
(TF)-DNA affinity is a central goal of regulatory 
genomics. Although thermodynamics provides the 
most natural language for describing the continuous 
range of TF-DNA affinity, traditional motif discovery 
algorithms focus instead on classification para- 
digms that aim to discriminate 'bound' and 
'unbound' sequences. Moreover, these algorithms 
do not directly model the distribution of tags in 
ChlP-seq data. Here, we present a new algorithm 
named Thermodynamic Modeling of ChlP-seq 
(TherMos), which directly estimates a position- 
specific binding energy matrix (PSEM) from ChlP- 
seq/exo tag profiles. In cross-validation tests on 
seven genome-wide TF-DNA binding profiles, one 
of which we generated via ChlP-seq on a complex 
developing tissue, TherMos predicted quantitative 
TF-DNA binding with greater accuracy than five 
well-known algorithms. We experimentally validated 
TherMos binding energy models for Klf4 and Esrrb, 
using a novel protocol to measure PSEMs in vitro. 
Strikingly, our measurements revealed strong non- 
additivity at multiple positions within the two 
PSEMs. Among the algorithms tested, only 
TherMos was able to model the entire binding 
energy landscape of Klf4 and Esrrb. Our study 
reveals new insights into the energetics of TF-DNA 
binding in vivo and provides an accurate first-prin- 
ciples approach to binding energy inference from 
ChlP-seq and ChlP-exo data. 



INTRODUCTION 

One of the central goals of functional genomics is to 
understand how transcription factors (TFs) bind to 
specific functional elements in the genome to regulate 
gene expression. This specificity is conferred primarily by 
the intrinsic sequence preference, i.e. the binding energy 
landscape, of the DNA-binding TF. If a TF binds nucleo- 
tide sequences of length n, this landscape is defined by the 
TF-DNA binding free energy of each of the 4 n possible 
DNA n-mers. However, it is common to assume that each 
nucleotide contributes independently to the binding 
energy, and that the total interaction energy is, therefore, 
merely the sum of the n individual contributions. This is 
the so-called 'additive' model of TF-DNA binding energy 
(1). Although deviations from this additive model have 
long been noted (2,3), it is still the most widely used para- 
digm because of its simplicity. More general algorithms 
that attempt to fit non-additive models to experimental 
data could be susceptible to overfitting because of the 
large number of free parameters in such models. This is 
particularly true when the training data are subject to 
modulation by in vivo factors, such as chromatin state. 
Thus, in practice, even when non-additivity is a known 
or suspected feature of TF-DNA binding energy, it is im- 
portant to define the best possible additive approximation 
to the non-additive landscape. All widely used algorithms 
for in vivo motif discovery adopt this additive strategy, 
and so do we. 

With the aforementioned assumption, the binding 
energy landscape can be represented by a position- 
specific energy matrix (PSEM) with n columns and four 
rows — one for each of the four possible nucleotides. 
However, for historical reasons, PSEMs have rarely 
been used to represent TF-DNA binding energy. 
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Traditionally, TF-DNA binding energy models have been 
inferred from limited data sets comprising a small number 
of experimentally validated binding sites, either in the 
genome or in vitro oligomer binding assays (4). Such 
data sets are usually insufficient for quantitative estima- 
tion, and the DNA sequences are, therefore, merely clas- 
sified as 'bound' or 'unbound'. The binding energy 
landscape is modeled by a 'position weight matrix' 
(PWM), which is most commonly defined as the log-like- 
lihood (or log-odds relative to background DNA se- 
quences) of observing a specific nucleotide at a specific 
position in the bound n-mers. Many algorithms have 
been developed to infer PWM motifs from TF-DNA 
binding data using the traditional bound-unbound 
paradigm, including the widely used MEME (5) and 
Weeder (6). 

It has been shown that, when the PWM is derived from 
sites bound at low TF concentrations in vitro, it is approxi- 
mately equivalent to the PSEM (7,8). However, the pro- 
portionality of PWMs to binding energies is only 
approximate, as there is some arbitrariness in the classifi- 
cation of DNA sequences as bound or unbound. In 
reality, the occupancy (binding probability) of a TF at 
any DNA site varies continuously between zero and one. 
The bound-unbound approach requires selection of an 
arbitrary threshold for discretizing this continuously 
variable binding level. Consequently, different thresholds 
for defining bound sites would yield different PWMs for 
the same TF (8). 

With the advent of high-throughput microarray tech- 
nology, it became possible to quantify TF-DNA binding 
on a genomic scale using chromatin immunoprecipitation 
followed by array hybridization (ChlP-chip). The 
MatrixREDUCE algorithm (9) was developed to exploit 
the quantitative binding information in ChlP-chip by 
directly fitting a thermodynamic position-specific affinity 
matrix (PSAM) to the range of binding intensities 
observed in the probed genomic regions. The logarithm 
of the PSAM is equal to the negative of the PSEM (9). 
This algorithm explicitly accounts for the continuous 
nature of binding levels. However, as it was designed for 
ChlP-chip data, which typically has low resolution 
(hundreds of base pairs), the algorithm only makes use 
of the aggregate binding intensity within an entire 
genomic segment. 

Recently, ChlP-chip has been supplanted by ChlP-seq, 
which uses massively parallel sequencing instead of array 
hybridization to identify TF-bound regions genome wide 

(10) . ChlP-seq provides higher resolution (tens of base 
pairs) and more comprehensive genome-wide profiling of 
binding sites than ChlP-chip. Moreover, ChlP-seq peak 
height is well correlated with quantitative binding levels 

(11) . Although MatrixREDUCE applies equally well to 
ChlP-seq data, its aggregate-intensity approach cannot 
fully exploit the rich information content of the ChlP- 
seq tag distribution. Zhao et al. (12) have recently de- 
veloped an algorithm for inferring binding energy 
models from high-throughput sequencing-based data on 
TF-DNA binding. However, this method (BEEML) is 
only applicable to data on in vitro binding of TFs to 
short DNA fragments. We are not aware of any 



equivalent algorithms for binding energy inference 
from in vivo ChlP-seq data. To fully exploit the informa- 
tion contained in the shape of the ChlP-seq tag profile, we 
developed a PSEM estimation method named 
Thermodynamic Modeling of ChlP-seq (TherMos). 
TherMos can also be used on ChlP-exo data, which 
provide even higher spatial resolution than ChlP-seq (13). 

Through cross-validation on five ChlP-seq data sets 
from mouse embryonic stem (mES) cells (14), one newly 
generated ChlP-seq data set from mouse embryonic spinal 
cord (GEO accession number GSE43159), and one ChlP- 
exo data set from Saccharomyces cerevisiae (13), we found 
that the TherMos binding energy model has higher 
accuracy than other widely used algorithms. We further 
confirmed the high accuracy of TherMos by performing 
systematic in vitro measurements of quantitative TF-DNA 
affinity. In the course of in vitro validation, we discovered 
that both of the TFs analyzed in this manner showed 
striking deviations from the additive binding energy 
model. As a result of this non-additivity, the in vivo 
motifs detected by the traditional PWM-based algorithms 
were accurate only on high-affinity sequences. In contrast, 
the TherMos in vivo PSEM was predictive of in vitro 
binding affinity over the entire range of sequences tested. 



MATERIALS AND METHODS 

Biophysical model 

The interaction between a TF and a DNA sequence can be 
written as 



TF+D 



TF- D 



where TF is the TF, D is the DNA sequence and TF' D is 
the complex of the TF and the DNA sequence. The dis- 
sociation equilibrium constant of the reaction K d can be 
written as 



K d 



[TF\[D] 
[TFD] 



AG/RT 



(1) 



where [ ] represents concentration, A G is Gibbs free energy 
change of the reaction, R is gas constant and T is 
temperature. 

The occupancy 0(D) of any DNA sequence D by the 
TF is defined as the probability of binding or fraction 
bound for that sequence, and it can be written as 
(8,9,12) 



0(D) 



[TF ■ D] 
[TF ■ D] + [D] 
1 



1 

\+K d /[TF]~Y 



1 

l 

[TF\/K, 



(2) 



1 



l 

[Tf]/K d (ref)exp(.-AAG/RT) 



where K d (rej) is the dissociation equilibrium constant for 
the reaction between the TF and the reference sequence. 
Hence, 



AAG= AG- AG(ref) 



(3) 
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Position-specific energy matrix 

For an n-mer motif, a PSEM is denned as a 4 x n matrix 
with the rows 1-4 correspond to A, T, G, C, respectively. 
Each element of the PSEM contains relative DNA 
protein binding free energy change to the reference 
sequence (in unit of RT), AAG/RT [Equation (3)], 
directly related to the actual discrimination energies in 
physical units. Therefore, only three elements in each 
column of the PSEM are independent (15,16). For con- 
venience, the elements in the last row of the PSEM are set 
to zeros. Smaller AAG/RT means stronger binding 
affinity compared with the reference sequence. The inde- 
pendence among the positions in the target sequence is 
assumed (17). Consequently, AAG is additive across pos- 
itions in the binding site. 

ChlP-seq data sets 

The in vivo ChlP-seq data were derived from chromatin 
extracted from the dorsal domain of the dissected spinal 
cords of embryonic day 12.5 (E12.5) mouse embryos. 
Immunoprecipitation was performed using a monoclonal 
Mashl/Ascll antibody as described in Castro et al. (18). 
ChlP-seq data for the five mES cell TFs (Esrrb, Klf4, 
Stat3, Zfx and n-Myc) are from Chen et al. (14). Yeast 
Rebl ChlP-exo data are from Rhee and Pugh (13). 

Peak calling 

TherMos takes a set of externally generated peak calls as 
input. Here, we used the MACS algorithm with default 
settings and a P-value threshold of 1 x 10~ 10 to call peaks 
(19). The control library from Chen et al. (14) for ChlP- 
seq data set in mES cells is used. No control library is 
available for Mashl and Rebl data set. The numbers of 
peak calls are 31 621, 7508, 1380, 9678, 5320 and 2521 for 
Esrrb, Klf4, Stat3, Zfx, n-Myc and Mashl, respectively. 
For ChlP-exo of Rebl, 1776 peaks called in Rhee and 
Pugh (13) were used. To save computational time, one- 
third of the Esrrb peak calls (10 540) were randomly 
chosen as input for TherMos and other algorithms. 

TherMos implementation 

In TherMos, we assume that binding affinities are only 
determined by the interaction between TFs and DNA se- 
quences. Effects of chromatin status, competing nucleo- 
somes and cooperative or competing TFs are neglected. 
The free parameters optimized in TherMos are the 3 x n 
elements in the PSEM for an n-mer motif (n is user- 
specified), plus the scaled TF concentration parameter 
[TF\/K d (ref) [Equation (2)] (15). Input of TherMos 
include a set of ChlP-seq tag coordinates, a set of 
control-library tag coordinates and a set of externally 
generated peak calls. TherMos is designed to exploit the 
information from ChlP-seq peak shape within peak calls. 

Based on the GC content of the control library, 
TherMos first performs GC bias correction on the ChlP- 
seq tag counts (Figure 1A and Supplementary 
Information). Second, a smoothing weight, i.e. the 
average shape of the tag distribution at binding sites, is 
derived for the forward and reverse GC-corrected tags 



(Figure IB and Supplementary Information). Within 
1-kb regions centered on peak calls, the reverse or 
forward GC-corrected tag counts are smoothed using 
the forward or reverse smoothing weights and added to 
generate a ChlP-seq profile (Y obs ) (Figure 1C). 

Y obs shows some background noise because of 
randomly generated tags regardless of the specific 
antibody. A noise level is estimated as the average tag 
counts of the ChlP-seq library in the whole genome and 
subtracted from Y obs . High ChlP-seq peaks are observed 
because of the low complexity DNA (20) both in the 
ChlP-seq library and the control library. To eliminate 
these false-positive peaks, tag densities of the control 
library within 1-kb region centered on the peak calls are 
calculated and regions with tag densities >6x inter 
quartile range are removed. 

Next, TherMos generates an initial guess of PSEM as a 
starting point of the optimization routine (Figure ID). 
Each possible n-mer is given a score according to the 
average height of Y obs where the n-mer occurs. The 
n-mer, which has the highest z-score, is chosen as the con- 
sensus sequence. The initial guess of PSEM is then 
constructed from the consensus sequence and its all 
singly mutated variants based on their average Y obs . 

Then TherMos starts the optimization routine by 
scanning the PSEM along the genomic sequence to calcu- 
late thermodynamic occupancy estimates at each position 
within 1 kb regions centered on peak calls [Figure IE and 
Equation (2)]. An expected peak shape is obtained by 
convolving the forward and reverse smoothing weights. 
A predicted ChlP-seq profile (Y pred ) is generated by 
smoothing the estimated occupancy profile using the 
expected peak shape. Smoothing the occupancy of each 
n-mer with the expected peak shape is equivalent to two- 
step smoothing. The occupancy is firstly smoothed with 
the reverse smoothing weight to generate positions of the 
reverse tags relative to the position of the occupancy 
(position of the n-mer). Second, equivalent to the con- 
struction of Y obs , smoothing the aforementioned predicted 
reverse tags profiles with the forward smoothing weight to 
generate the Y pred . Both forward and reverse strands 
are taken into account in the occupancy calculation. To 
allow for comparison, Y pred in each 1-kb binding region 
is scaled according to Y obs in the same region. Then 
the Levenberg-Marquardt algorithm (21,22) is used to 
minimize the squared prediction error ||Y obs — Y pred || 2 
(SPE) covering all the 1-kb binding regions by iteratively 
updating the PSEM. Finally, TherMos outputs a PSEM 
when the SPE is smaller than a threshold (Figure IF). 
The TherMos program can be downloaded from http:// 
collaborations. gis.a-star.edu.sg/~cmb6/TherMos. 

Cross-validation 

To evaluate the performance of TherMos, a 10-fold cross- 
validation test was performed for TherMos and other five 
motif discovery algorithms. Settings and details of running 
these five algorithms can be found in the Supplementary 
Information. PWMs predicted by Weeder, MEME, 
DREME (23) and ChlPMunk (24) were converted to 
PSEMs as described previously (7). In each round of the 
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Figure 1. Workflow of the TherMos algorithm. (A) Enrichment of tags in the control library as a function of local GC content. (B) Forward and 
reverse smoothing weights (tag distribution at binding sites) estimated using an iterative peak refinement procedure. Most tag fragments are < 100 bp 
from the binding site. (C) Construction of ChlP-seq profile (Y obs ) from per base pair tag profile. (D) Initial guess of PSEM. (E) Theoretically 
predicted ChlP-seq profile (Y pre j), based on PSEM and inferred average peak shape. (F) PSEM that best fits the Y obs . 



test, 90% of the ChlP-seq data were randomly chosen as 
the training set and the remaining 10% were used as the test 
set. Another parameter [TF\jK c i(ref) [Equation (2)] was 
optimized using the training set for the algorithms except 
TherMos, respectively. The SPE (||Y obs - Y pred || 2 ) and the 
rank correlation coefficient between the total occupancy 
and the total tag counts within 1-kb region centered on 
the peak calls in the test set were calculated. To compare 
the cross-validation results for each algorithm, the average 
SPE and rank correlation coefficient from all of the folds 
for each TF were calculated, respectively. 

Model for deriving K d and PSEM from competitive 
electrophoretic mobility shift assays 

In an competitive electrophoretic mobility shift assay 
(EMSA), labeled and unlabeled DNA sequences 
compete with each other to bind TFs. 



TF+D L 



TF+D, 



TFD L 
TF ■ Djj 



where D L and D v are labeled and unlabeled DNAs, re- 
spectively. The dissociation equilibrium constants for the 
two reactions are 



K c i\ 



Kd2 — 



[TF][D L ] 
[TF ■ Di] 

[TF][Du] 
[TF - Dy] 



(4) 



(5) 



The total concentrations of labeled DNAs, unlabeled 
DNAs and TFs, i.e. [D L ]„ [fly], and [TF], are known. 



[D L ], = [D L ] + [TF-D L ] 



(6) 
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[D u ] t = [D v ] + [TF.D v ] 

[TF], = [TF] + [TF ■ D L ] + [TF ■ D v ] 



(7) 
(8) 



The fraction bound / measured from the competition 
assay is defined as 



/ 



[TF ■ D L ] 



[D L ] t 

Based on Equations (4-7), 

K d2 = [D L/ ][TF D L ] = [D v ] [D L ], - [D L ] 
K dl [TF-D V ][D L ] [D v ] t - [D v ] [D L ] 



(9) 



(10) 



If the unlabeled competitor DNA sequence is the same 
as the labeled consensus DNA sequence, we assume that 
Kn = Kjz. Then, 



Kfl 



[Drf [D L \, - [D L ]' 



[D LI ]' - [Da]' 



1 



(11) 



The quantities with prime are used here to indicate this 
special case. From Equation (11), we get 

[Du]' = [Du]' r (l-f) (12) 
Combine Equations (7-9) and (12), we get 

[TF]' = [TF], -f[D L \ -f[D v l (13) 
Then from Equations (4, 6, 9 and 13), K d \ can be written 



as 



Kdi 



[TFf[DJ [TF]'(l -/) 



[TF-D L Y 

1-/ 



(14) 



([TF\, -f[D L ], -f[Dvl) 



As not all the proteins are active in the solution, the 
active total protein concentration needs to be determined 
before K c /i and Kdi can be calculated. The lower limit of 
the active total protein concentration can be determined 
from Equation (14) and K c/l > 0. That is, 

[TF] t> f([D L ] t + [Du]' t ) (15) 

If K rl i ^Kjj, based on Equations (6, 7) and (9, 10), 
Kdi f Wu] f [D v ] t -[TF-Du] 



Kdi \-j [TFD LI \ l-f [TF - Du] 

i-j \tf-Du] 

From Equations (4, 8 and 9), 



(16) 



[TF-D v ] = [TF] t -K d 



ADl\ 



1-/ 

Combine Equations (14, 16 and 17), 
Kii = J_ [Du] t 



(17) 



1-/ 



J/ i-f 



(18) 



As Ktn/Kji > 0, so 

f - f fd - f) f(f- f) 

;[TF], + J \ J J [Du]' t + J -^/ [D L ], > 0 (19) 



/•(I-./) 



1-/ 



1-/ 



For those unlabeled competitors with f>f, the upper 
limit of the active total protein concentration can be 
determined as 



[TF], < 



f-f ■ (1 -/) 

./-/ 



Wul+f-f -WlI 



(20) 



Once the upper and lower limits of the active total 
protein concentration are determined, we take the mean 
of the limits as the active total protein concentration. Then 
Kd\ and Kd2 can be calculated using Equations (14 and 18) 
accordingly. Finally, the PSEM can be obtained based on 
the relationship 



AAG , Kdi 

= In — 

RT Kdi 



RESULTS 



(21) 



Overview of the TherMos approach 

TherMos infers an additive binding energy model (PSEM) 
using least-squares fitting to the ChlP-seq tag profile 
(Figure 1 and 'Materials and Methods' section). The al- 
gorithm takes as input a set of ChlP-seq tag coordinates, a 
set of control-library tag coordinates and a set of exter- 
nally generated peak calls. First, TherMos performs GC 
bias correction on the ChlP-seq tag counts and infers the 
average shape of the tag distribution at binding sites (1-kb 
regions centered on peak calls, Supplementary 
Information). The GC-corrected per base pair (un- 
binned) tag counts are then smoothed using the tag distri- 
bution to generate the ChlP-seq profile (Y obs ). TherMos 
starts the optimization routine by using a heuristic to 
generate an approximate PSEM from Y obs . This PSEM 
is scanned along the genomic sequence to calculate 
thermodynamic occupancy estimates at each position 
within 1-kb regions centered on peak calls ('Materials 
and Methods' section). The estimated occupancy profile 
is converted into a predicted ChlP-seq profile (Y pred ) using 
the appropriate tag-distribution-based peak shape 
('Materials and Methods' section). The squared prediction 
error ||Y obs — Y pred || 2 is then minimized by iteratively 
updating the PSEM in successive rounds of optimization. 
The output of the TherMos algorithm is a PSEM that 
models the free energy of TF-DNA binding and can 
readily be expressed as a sequence logo (1,25). 

TherMos outperforms other algorithms in 
cross-validation tests 

We used TherMos to derive PSEMs for six TFs spanning 
a broad range of DNA-binding domains, based on 
ChlP-seq data from mES cells (Esrrb, Klf4, Stat3, Zfx 
and n-Myc) (14) and ChlP-exo data from S. cerevisiae 
(Rebl) (13). In addition, to evaluate TherMos on data 
from a heterogeneous tissue, we generated and analyzed 
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Mashl ChlP-seq data from mouse spinal cord at embry- 
onic day 12.5 ('Materials and Methods' section). Only the 
dorsal region of the spinal cord was analyzed, as Mashl 
expression is restricted to the dorsal domain at this time 
point (26). As seen in Figure 2A, the Mashl ChlP-seq 
profile showed strong peaks at Fbxw7 and Dill, two 
known targets of the TF (18). 

To visualize the results, the seven PSEMs were con- 
verted into sequence logos (Supplementary Figure SI). 
Reassuringly, the motifs discovered by TherMos quali- 
tatively match the known motifs for corresponding 
factors (TRANSFAC database) (27). For comparison, 
we also inferred binding affinity matrices using 
MatrixREDUCE, and PWMs using four well-known al- 
gorithms: Weeder, MEME, DREME and ChlPMunk. 
The accuracy of the inferred PSEMs, PSAMs and 
PWMs was quantified through 10-fold cross-validation 
analysis using two different performance metrics. The ob- 
jective was to determine whether the methods, when 
trained on 90% of the ChlP-seq peaks, could predict the 
ChlP-seq profile in 1-kb regions centered on the remaining 
10% (test set). We first measured accuracy as the SPE 
between the predicted and measured ChlP-seq profiles in 
the test set (Figure 2B and 'Materials and Methods' 
section). A smaller SPE indicates higher accuracy. By 
this metric, TherMos ranked first on all but one of the 
seven data sets, and it had the smallest average SPE. 
ChlPMunk had the second-lowest average SPE, and the 
remaining four had similar average SPE to each other. 

As a second method to evaluate the performance of 
TherMos, we computed the rank correlation coefficient 
between the total predicted TF occupancy and the total 
tag count in the test set of 1-kb peak regions (Figure 2C). 
Note that the rank correlation metric is precisely the 
quality measure optimized by MatrixREDUCE, and this 
could potentially give the algorithm an advantage in this 
particular comparison. However, TherMos was once 
again the most accurate overall, and ChlPMunk again 
ranked second (though by a narrower margin). Sequence 
logos and box-plots illustrating the range of SPE and 
rank correlation values for each TF and each algorithm 
are presented in Supplementary Figures S2-S7. In 
summary, the PSEMs inferred by TherMos show the 
highest overall accuracy in predicting ChlP-seq and 
ChlP-exo profiles. 

TherMos accurately predicts Esrrb in vitro binding energy 

To experimentally benchmark the performance of 
TherMos in predicting the intrinsic binding energy of 
TFs, we developed a competitive EMSA protocol that 
can measure PSEMs in vitro ('Materials and Methods' 
section and Supplementary Information). We first 
applied this validation approach to the nuclear receptor 
Esrrb. As in the standard EMSA competition assay, we 
mixed a labeled high-affinity DNA fragment with the 
purified Esrrb DNA-binding domain and multiple un- 
labeled competitor DNA fragments, and then quantified 
the fraction of labeled DNA fragments that bound Esrrb. 
The bound fractions were then used to infer the dissoci- 
ation constants of TF binding to the competitor fragments 



('Materials and Methods' section). Using the 9-bp Esrrb 
'consensus' element CCAAGGTCA as the core of the 
labeled fragment, we tested 28 competitors: the consensus 
sequence itself, plus all 27 (3 x 9) singly mutated variants 
of the consensus (Figure 3B). From the resulting bound- 
fraction data, we estimated an additive in vitro PSEM 
for Esrrb. The equivalent sequence logo is shown in 
Figure 3C. 

To use the EMSA-generated in vitro PSEM for Esrrb as 
a benchmark, we require a measure of how different the 
benchmark PSEM is from the PSEMs and log-odds 
PWMs predicted by the six algorithms. For this 
purpose, we transformed all binding energy and affinity 
models into their equivalent nucleotide frequency matrix, 
and then we used Euclidean distance between the experi- 
mentally and computationally derived matrices as a 
measure of prediction error (Supplementary 
Information). We adopted this approach because 
Euclidean distance was found to be the best performer 
in a systematic assessment of seven different distance 
measures for TF-binding motifs (28). By this measure, 
the binding energy model measured in vitro for Esrrb 
was closest to the ChlPMunk prediction (Figure 3D). 
TherMos ranked second in prediction accuracy, followed 
by MEME, DREME, Weeder and MatrixREDUCE. 

On visual inspection, we noticed systematic differences 
between the thermodynamic methods (TherMos and 
MatrixREDUCE) and the dichotomous 'bound- 
unbound' methods (Weeder, MEME, DREME and 
ChlPMunk). The dichotomous algorithms predicted a 
strict 'AA' sequence at positions 3 and 4 in the Esrrb- 
bound n-mer (Figure 3A), whereas the two thermo- 
dynamic methods were relatively tolerant of mismatches 
at those two positions. To quantify this effect, 
we examined the per position Euclidean distance of 
the six algorithms (Figure 4A). In comparison with 
the dichotomous methods, the biophysical methods 
deviated strongly from the in vitro PSEM at positions 
3 and 4. 

The local discrepancy in Esrrb binding energy models 
has an intriguing parallel in protein-binding microarray 
(PBM) measurements of TF-DNA binding energy (29). 
According to the PBM data, the binding energy landscape 
of Esrra, a close paralog of Esrrb, could not be modeled 
by a single additive PWM. A better fit was obtained by 
combining two PWMs that differed strongly at positions 
2-4 but were almost identical elsewhere (Figure 4B) (29). 
Thus, Esrra (and presumably also Esrrb) DNA-binding 
energy shows position interdependence at precisely the 
location where TherMos and MatrixREDUCE differed 
from enrichment-based PWM models. The primary 
Esrra motif strongly resembled the in vitro EMSA 
binding energy model for Esrrb and also the models of 
the four enrichment-based algorithms. In contrast, the 
TherMos model presumably represents an additive ap- 
proximation to the non-additive binding energy contribu- 
tions of positions 2-A. The same could be said for 
MatrixREDUCE, except for a scaling of the information 
content relative to TherMos. 

The non-additive Esrra/b binding energy model 
inferred by Badis et al. (29) from PBMs provides a 
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qualitative explanation for the observations noted earlier 
in the text. First, the similarity of the in vitro EMSA 
binding energy model to the 'primary motif of Badis 
et al. is not surprising when one considers that both 
were inferred from single mutations relative to the con- 
sensus n-mer. As double mutations were not considered, 
these two motifs represent the additive binding energy 
landscape in the immediate vicinity of the consensus 
binding n-mer. The EMSA model and the PBM-derived 
primary motif are, by definition, 'blind' to non-additivity. 
In contrast, TherMos and MatrixREDUCE fit quantita- 
tive binding models to a large array of genomic binding 
sites, many of which contain multiple mutations relative 



to the consensus. We would, therefore, expect these two 
algorithms to estimate binding energy models that repre- 
sent an additive approximation to the entire binding 
energy landscape. As the primary and secondary PBM 
motifs differ mainly at positions it is thus not 

surprising that TherMos and MatrixREDUCE show 
higher tolerance for sequence variability at those motif 
positions. The behavior of the statistical algorithms 
(Weeder, MEME, DREME and ChlPMunk) was not 
predictable a priori, but it is possible that the bound- 
unbound paradigm used by such methods might cause 
them to systematically converge on the higher-affinity 
primary motif. 
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Esrrb multi-mutations deviate from additive model: 
confirmation using EMSA 

To independently confirm the prediction that position- 
interdependence affects Esrrb— DNA binding energy, we 
used EMSA to measure the affinity of two multiply 
mutated versions of the Esrrb consensus binding site. 
These two oligomers were designed to match the second- 
ary Esrra motif inferred by Badis et al. (29) from PBM 
data. Our hypothesis was, therefore, that they would bind 
Esrrb with significantly greater affinity than predicted 
from the single-mutation EMSA measurements 
(Figure 4C). Indeed, we found that both sequences 
bound Esrrb more strongly than expected. The sequence 
TAGGGGTCA, which exactly matched the secondary 
Esrra motif from PBM measurements, showed the most 
dramatic deviation; it bound Esrrb with 100-fold greater 
affinity than predicted from the single mutations. In 
general, non-specific binding at the extreme low end of 
the affinity spectrum could be one potential cause of de- 
viations from the additive binding energy model. 
However, this is an unlikely explanation for our results, 



as the two multi-mutated oligomers are well within the 
affinity range of singly mutated sequences. In fact, the 
affinity of the TAGGGGTCA sequence was 29-fold 
higher than that of the weakest measured binder (CCAA 
GCTCA), indicating that non-specific binding is not likely 
to be a factor. Thus, these results validate our hypothesis 
that the relatively low specificity of the TherMos binding 
energy model and MatrixREDUCE affinity model for 
Esrrb at positions 2-4 are a consequence of non-additive 
binding energy contributions at those positions. 

TherMos accurately predicts Klf4 in vitro binding energy 

For further validation of the performance of TherMos in 
predicting the intrinsic binding energy of TFs, we per- 
formed a similar analysis of the C 2 H 2 zinc-finger protein 
Klf4 binding energy (Figure 5). In this case, the motif 
inferred from ChlP-seq data by ChlPMunk showed the 
closest overall resemblance to the in vitro EMSA PSEM, 
and TherMos ranked third. Interestingly, we again found 
that the binding site contained a sub-region (positions 8 
and 9) where the biophysical methods (TherMos and 
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MatrixREDUCE) tolerated mismatches to the consensus, 
whereas the statistical methods inferred a fairly strict 
requirement for the consensus dinucleotide (GG) 
(Figure 5A). This dichotomy was again quantitatively sup- 
ported by examining the Euclidean distance at each nu- 
cleotide position (Figure 6A). The discrepancy between 
TherMos and EMSA was also highest in this region (pos- 
itions 7-10). Encouragingly, we again found a parallel in 
the PBM data, which support two different binding energy 
matrices for Klf7, a close paralog of Klf4 (Figure 6B). As 
before, the two Klf7 PBM motifs differed most noticeably 
at exactly the positions where TherMos differed from 
EMSA (positions 7-10). Given the consistent localized 
discrepancies between the biophysical algorithms and 
our in vitro EMSA models, we hypothesized as before 



that the EMSA approach of measuring only singly 
mutated DNA sequences had failed to capture the full 
breadth of the binding energy landscape of Klf4. 

TherMos in vivo PSEM predicts binding energy landscape 
of Klf4 multi-mutations in vitro 

To independently confirm the non-additivity of nucleotide 
interaction energies in the Klf4 binding site, we designed a 
second round of in vitro affinity measurements focused on 
multiple mutations at interdependent positions within the 
Klf4 binding site. We also hypothesized that, in this test, 
TherMos would outperform the binding energy models 
that resembled the primary Klf7/4 PBM motif. 
Combinatorial mutations at positions 5, 7, 8, 9 and 10 
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were designed based on significant differences between the 
two PBM models at these positions (Figure 6). We used 
EMSA as before to measure the in vitro affinity of 25 
multiply mutated DNA fragments (Figure 6C). 

Strikingly, the additive binding energy model based on 
single-mutation in vitro measurements failed almost com- 
pletely in predicting the binding energy of multiply 
mutated Klf4 binding sites (Pearson correlation coeffi- 
cient: R = 0.19) (Figure 6D). This result strongly 
supports binding energy interdependence at positions 5, 
7, 8, 9 and 10 and highlights the inadequacy of the 
binding energy model inferred from the single-mutation 
in vitro assay. Given the poor performance of the single- 
mutation in vitro motif for Klf4, we hypothesized that the 
primary PBM-derived motif would also perform poorly in 
this multi-mutation binding energy prediction test. Indeed, 
we found that this was the case (R = 0.08). In contrast, the 
secondary PBM-derived motif displayed good predictive 
power (R = 0.66), indicating that it represents a reason- 
able additive approximation to the binding energy land- 
scape far from the consensus binding n-mer. 

We evaluated the ability of the various in vivo Klf4 
binding models inferred from ChlP-seq data to predict 
the in vitro binding energy of the multi-mutated DNA se- 
quences. In this test, TherMos was the only algorithm that 
provided accurate binding energy predictions (R = 0.6). 
The other five algorithms were mostly unable to predict 
multi-mutation affinities (R<0.1), despite their accuracy 
in the previous single-mutation benchmark. Thus, the 
other algorithms fail to capture the binding energies of 
sites that deviate significantly from the high-affinity con- 
sensus at interdependent positions. As the TherMos model 
is based on the entire binding energy landscape of in vivo 
binding sites, it is able to provide a more accurate additive 
approximation to the in vitro binding energy even in the 
presence of position interdependence. 



DISCUSSION 

The occupancy of a TF at any given genomic binding site 
is related to its intrinsic binding energy, which can most 
naturally be described using the language of thermo- 
dynamics. Viewed from a thermodynamic perspective, 
every DNA n-mer has some non-zero likelihood of being 
bound by a given TF, with this likelihood (i.e. occupancy) 
being a continuous function of the binding free energy of 
the n-mer. However, before the advent of high-throughput 
methods, this continuum of TF-DNA binding energy was 
commonly discretized into two categories: bound and 
unbound. The bound-unbound dichotomy is artificial, 
and also somewhat arbitrary, as it reflects the detection 
limits of specific biochemical binding assays, rather than 
any inherent bimodality in TF-DNA binding levels. 
Nevertheless, it was unavoidable in most cases because 
of the limitations of traditional forms of the training 
data. Consequently, bioinformatic methods typically 
eschewed explicit thermodynamic modeling and instead 
favored a machine-learning approach based on motif en- 
richment in bound sequences relative to unbound 
sequences. 



Now that multiple high-throughput methods exist for 
generating quantitative binding profiles, it is possible to 
adopt a more natural thermodynamic formalism for motif 
detection, based on a continuum of free energy and occu- 
pancy. Continuous-occupancy binding energy models are 
already incorporated in some modern motif detection 
algorithms (9,12). However, using a thermodynamic 
approach in conjunction with ChlP-seq data on in vivo 
TF binding requires additional effort. Most importantly, 
TherMos has the ability to predict the precise shape of the 
ChlP-seq profile implied by any particular binding energy 
model. This feature is key to the ability of TherMos to 
exploit the rich information content of the ChlP-seq tag 
distribution, and likely contributes to the robustness of the 
algorithm. 

Uniquely, TherMos fits to the peak shape in binding 
regions, rather than to absolute peak height (see 
'Materials and Methods' section). This largely insulates 
the algorithm from locus-specific scaling of TF binding 
levels by the local chromatin state. As MatrixREDUCE 
in effect predicts peak height rather than peak shape, it is 
more susceptible to distortions from chromatin state vari- 
ation; therefore, it infers binding energy models of rela- 
tively low information content (Figures 3 and 5). When 
the locus-specific scaling feature is disabled, TherMos 
similarly produces PSEMs with low information content, 
although the effect is not as pronounced (data not shown). 

In this study, we used a two-pronged approach to com- 
prehensively validate the TF-DNA binding energy and 
affinity models inferred from ChlP-seq and ChlP-exo 
data by TherMos and other algorithms. The validation 
strategy included both in vivo cross-validation and 
in vitro EMSA assays for quantifying dissociation con- 
stants. Overall, these analyses indicated that the binding 
energy models estimated by TherMos provided the most 
accurate representation of the entire binding energy land- 
scape. This was particularly true when the non-additivity 
of interaction energies across neighboring nucleotides in 
Esrrb and Klf4 binding sites was taken into account. 

Recent studies based on high-throughput in vitro TF- 
DNA affinity measurements suggest that additive models 
of TF-DNA binding energy are generally effective, and 
only occasionally violated (30-32). In particular, it was 
found that non-additivity was prevalent in vitro among 
TFs from the zinc-finger and zipper classes (31). Our ex- 
periments were performed before publication of these 
studies. However, coincidentally, the two TFs we 
selected for non-additive binding energy analysis happen 
belong to the zinc-finger class. 

For both Esrrb and Klf4, our in vitro dissociation 
constant measurements provide strong evidence for non- 
additivity in the binding energy landscape. Strikingly, the 
TAGGGGTCA DNA oligomer, which was predicted to 
have negligible affinity for Esrrb based on the additive 
model, displayed 100-fold greater affinity than expected. 
This is highly consistent with the PBM-based prediction 
of Badis et al. that Esrra, a close paralog, can bind a sec- 
ondary DNA motif whose consensus sequence is AGGG 
GTCA. We performed a more systematic survey of non- 
additivity for Klf4, by measuring the in vitro affinity of 25 
multiply mutated versions of the consensus binding site. 
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We designed these mutations to coincide with positions 
that differed between the primary and secondary Klf7 
motifs inferred by Badis et al. from PBM data. 
Remarkably, the binding energy of Klf4 for the multiply 
mutated sequences bore almost no resemblance to the 
additive PSEM inferred from our single-mutation 
EMSA assays. Rather, they closely matched the secondary 
Klf7 PBM motif. Thus, our EMSA measurements inde- 
pendently confirm the PBM-based predictions of non- 
additive binding for Klf4/Klf7 and Esrra/Esrrb. 

Little is known about the in vivo significance of non- 
additive binding. It is conceivable that the non-additivity 
observed in vitro may only affect DNA sequences whose 
affinity is too low to have any effect on genomic binding. 
However, the TAGGGGTCA Esrrb-binding sequence 
does not fit this pattern; its affinity is within a factor 
of 50 of the consensus n-mer, and in fact higher than 
that of nine singly mutated versions of the consensus. 
The TherMos PSEMs inferred from ChlP-seq data 
provide even more direct evidence for the in vivo import- 
ance of non-additivity. Consider a hypothetical scenario 
in which the additive model is dominant in vivo and suf- 
ficient to explain genomic TF binding. In such a 
scenario, the TherMos PSEM would correlate with the 
primary PBM motif of Badis et al., but not with the 
secondary motif. However, we see that the TherMos 
PSEM diverges significantly from the primary motif, 
and this divergence occurs precisely at the nucleotide 
positions where the two PBM models diverge. 
Similarly, if additive binding were dominant in vivo, the 
TherMos PSEM would fail to predict the binding energy 
of Klf4 binding sites that violated the additive model 
in vitro. However, TherMos can indeed predict the 
binding energy of such sequences (Figure 6D). Thus, 
our results consistently reflect the influence of non- 
additivity on TF binding in vivo. 
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